Sampling Algorithms and Coresets

نویسندگان

  • ANIRBAN DASGUPTA
  • MICHAEL W. MAHONEY
چکیده

The p regression problem takes as input a matrix A ∈ Rn×d, a vector b ∈ Rn, and a number p ∈ [1,∞), and it returns as output a number Z and a vector xopt ∈ Rd such that Z = minx∈Rd ‖Ax− b‖p = ‖Axopt − b‖p. In this paper, we construct coresets and obtain an efficient two-stage sampling-based approximation algorithm for the very overconstrained (n d) version of this classical problem, for all p ∈ [1,∞). The first stage of our algorithm nonuniformly samples r̂1 = O(36pdmax{p/2+1,p}+1) rows of A and the corresponding elements of b, and then it solves the p regression problem on the sample; we prove this is an 8-approximation. The second stage of our algorithm uses the output of the first stage to resample r̂1/ 2 constraints, and then it solves the p regression problem on the new sample; we prove this is a (1 + )-approximation. Our algorithm unifies, improves upon, and extends the existing algorithms for special cases of p regression, namely, p = 1, 2 [K. L. Clarkson, in Proceedings of the 16th Annual ACM–SIAM Symposium on Discrete Algorithms, ACM, New York, SIAM, Philadelphia, 2005, pp. 257–266; P. Drineas, M. W. Mahoney, and S. Muthukrishnan, in Proceedings of the 17th Annual ACM–SIAM Symposium on Discrete Algorithms, ACM, New York, SIAM, Philadelphia, 2006, pp. 1127–1136]. In the course of proving our result, we develop two concepts—well-conditioned bases and subspace-preserving sampling—that are of independent interest.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automated Scalable Bayesian Inference via Hilbert Coresets

The automation of posterior inference in Bayesian data analysis has enabled experts and nonexperts alike to use more sophisticated models, engage in faster exploratory modeling and analysis, and ensure experimental reproducibility. However, standard automated posterior inference algorithms are not tractable at the scale of massive modern datasets, and modifications to make them so are typically...

متن کامل

Sampling Algorithms and Coresets for lp Regression

The lp regression problem takes as input a matrix A ∈ R , a vector b ∈ R, and a number p ∈ [1,∞), and it returns as output a number Z and a vector xopt ∈ R such that Z = minx∈Rd ‖Ax− b‖p = ‖Axopt − b‖p. In this paper, we construct coresets and obtain an efficient two-stage sampling-based approximation algorithm for the very overconstrained (n ≫ d) version of this classical problem, for all p ∈ ...

متن کامل

Training Support Vector Machines using Coresets

Note: This work was done as a course project as part of an ongoing research effort that was recently submitted [2]. The submission, done in collaboration with Murad Tukan, Dan Feldman, and Daniela Rus [2], supersedes the work in this manuscript. We present a novel coreset construction algorithm for solving classification tasks using Support Vector Machines (SVMs) in a computationally efficient ...

متن کامل

Coresets Meet EDCS: Algorithms for Matching and Vertex Cover on Massive Graphs

Maximum matching and minimum vertex cover are among the most fundamental graph optimization problems. Recently, randomized composable coresets were introduced as an effective technique for solving these problems in various models of computation on massive graphs. In this technique, one partitions the edges of an input graph randomly into multiple pieces, compresses each piece into a smaller sub...

متن کامل

Core-Preserving Algorithms

We define a class of algorithms for constructing coresets of (geometric) data sets, and show that algorithms in this class can be dynamized efficiently in the insertiononly (data stream) model. As a result, we show that for a set of points in fixed dimensions, additive and multiplicative ε-coresets for the k-center problem can be maintained in O(1) and O(k) time respectively, using a data struc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009